Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
To better understand model assessment and selection.
We will learn resampling methods to help us achieve this objective.
Essentially, we fit a model over and over using different subsets of training data.
We will learn about Cross-Validation and Bootstrap
We know that:
So we use some techniques to estimate Test Error rate.
We will carve out a subset from the training data. This carved subset will not be used in fitting process. Then we use this carved subset to fit the model.
Recall
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Call:
lm(formula = mpg ~ horsepower + poly(horsepower, 2), data = Auto)
Residuals:
Min 1Q Median 3Q Max
-14.7135 -2.5943 -0.0859 2.2868 15.8961
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.639714 62.43 <2e-16 ***
horsepower -0.157845 0.005747 -27.47 <2e-16 ***
poly(horsepower, 2)1 NA NA NA NA
poly(horsepower, 2)2 44.089528 4.373921 10.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
calc_validation_mse <- function(pow) {
Auto_marked <- Auto|>
mutate(
set_train = sample(c(1,0),392,
replace =T,
prob = c(0.5,0.5))
)
Auto_test <- Auto_marked[Auto_marked$set_train == 1,]
Auto_valid <- Auto_marked[Auto_marked$set_train == 0,]
mod <- lm(mpg ~ poly(horsepower,pow),
Auto_test)
tibble(
power = pow,
validation_mse = mean((Auto_valid$mpg - predict(mod,Auto_valid))^2)
)
}
map_dfr(
c(1:10),
calc_validation_mse
)|>
ggplot(aes(power,validation_mse))+
geom_point(colour="steelblue")+
scale_x_continuous(breaks = c(1:10))+
geom_line(colour = "red")+
theme_bw()+
labs(
y = "MSE",
x = "Degree of polynomial"
)-> plot_powersProblems with Validation set approach
Use the penguins data from Palmerpenguins package.
Split data in 50/50 training and validation set.
Use species as response. Train a multinomial logistic regression to predict species.
Write a code that will record the validation error rate.
Iterate this a 100 time and plot a density chart of the validation error rate.
Less Bais compared to validation set approach. So, does not overestimate test error rate.
Since no randomness in training/testing split, very little variablity of test error rate.